Policy Approximation and its Advantages
short intro
In Policy gradient methods, the policy can be parameterized in any way, as long as …
- differentiable with respect to its parameters i.e. \(\nabla \pi(a|s,\boldsymbol{\theta})\) exists
- finite for all \(s\in S\) , \(a \in A(s)\) and \(\boldsymbol{\theta} \in \mathbb R^{d'}\)
In practice, to ensure exploration we generally require that the policy never becomes deterministic i.e. \(\pi(a|s,\boldsymbol{\theta}) \in (0,1)\)
Policy based methods offer useful ways of dealing with continuous action spaces, as we describe later in Section 13.7
natural and common kind of parameterization
If the action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preferences \(h(s,a,\boldsymbol{\theta}) \in \mathbb{R}\) for each state-action pair i.e. action space가 이산적이고 그렇게 크지 않은 경우, action에 대한 선호도를 parameterization한다.
The actions with the highest preferences \(\to\) highest probabilities of being selected.
\[\pi(a|s;\boldsymbol{\theta}) \overset{.}{=} \frac{e^{h({s,a,\boldsymbol{\theta}})}}{\sum_{b \in A(s)} e^{h(s,b,\boldsymbol{\theta})}}\]
comment : sutton교수님의 교재와는 다르게 \(b\)의 대상을 적어줬음.
We call this kind of policy parameterization soft-max in action preferences.
The action preferences themselves can be parameterized arbitariliy. For example, they might be computed by a deep artificial neural network (ANN), where \(\boldsymbol{\theta}\) is the vector of all the connection weights of network. Or the preferences could simply be linear in features, using feature vectors \(\bold{x}(s,a) \in \mathbb{R}^{d'}\) constricted by any of the methods.
\[\begin{aligned} & h(s,a,\boldsymbol{\theta}) = \boldsymbol{\theta}^T\bold{x}(s,a) \\ \end{aligned}\]$$